Data Introduction

A unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.

Data Link: https://www.kaggle.com/datasets/deepcontractor/unicorn-companies-dataset

Raw Data Quick View

There are 1037 data rows.

## [1] "Company"
## [1] "Valuation...B."
## [1] "Date.Joined"
## [1] "Country"
## [1] "City"
## [1] "Industry"
## [1] "Select.Inverstors"
## [1] "Founded.Year"
## [1] "Total.Raised"
## [1] "Financial.Stage"
## [1] "Investors.Count"
## [1] "Deal.Terms"
## [1] "Portfolio.Exits"

And 13 columns.

Data Cleaning

Define Important Function

In the data cleaning process, we made three functions for convenience.

NA Value Check

The function will return the number of NA for each columns

numberOfNa = function(df){
  flag = 'None' # set 'None' as a checking flag
  for(i in 1:ncol(df)){
    temp = df[, i] # extract column one by one
    n = length(temp[temp == flag]) # count how many 'None'
    print(paste(colnames(df)[i], n)) # print column name and 'None' quantity
  }
}

Drop NA Value

The function will drop all NA value of specific column

dropNone = function(df, columnName){
  drop = which(df[, columnName] == 'None')
  df = df[-drop, ]
  return(df)
}

Data Type Check

The function will return the data type for each column

checkType = function(df){
  for(i in 1:ncol(df)){
    temp = df[, i]
    print(paste(colnames(df)[i], '--->', typeof(temp)))
  }
}

Check NA For Each Columns

There number shows that how many NA value are there in every column

numberOfNa(df)
## [1] "Company 0"
## [1] "Valuation...B. 0"
## [1] "Date.Joined 0"
## [1] "Country 0"
## [1] "City 0"
## [1] "Industry 0"
## [1] "Select.Inverstors 17"
## [1] "Founded.Year 43"
## [1] "Total.Raised 24"
## [1] "Financial.Stage 988"
## [1] "Investors.Count 1"
## [1] "Deal.Terms 29"
## [1] "Portfolio.Exits 988"

Drop Data

First, Drop two columns with 988 NA values. Moreover, drop ‘Select.Inverstors’ column because of it’s redundancy.

Second, drop the data rows which have NA values. For example, ‘Founded.Year’ has 43 NA values, so we drop all of them.

df = df[, -which(colnames(df) %in% c('Financial.Stage', 'Portfolio.Exits'))] # Drop these two columns
# after check column Select.Inverstors
# it's not suitable to analysis
# and I think it's not important
df = df[, -which(colnames(df) %in% c('Select.Inverstors'))] # Drop 
### Drop NA for each columns
df = dropNone(df, 'Founded.Year') # drop NA in Founded.Year
df = dropNone(df, 'Deal.Terms') # drop NA in Deal.Terms
df = dropNone(df, 'Total.Raised') # drop NA in Total.Raised

Change Data Type

The output is the data type of each column before any processing.

## [1] "Company ---> character"
## [1] "Valuation...B. ---> character"
## [1] "Date.Joined ---> character"
## [1] "Country ---> character"
## [1] "City ---> character"
## [1] "Industry ---> character"
## [1] "Founded.Year ---> character"
## [1] "Total.Raised ---> character"
## [1] "Investors.Count ---> character"
## [1] "Deal.Terms ---> character"

Before doing any analysis, we have to clean the data value in each column.

For “Valuation…B”, change string data type into numeric data type. E.g., “$5.3” –> 5.3

For “Total.Raised”, change string data type into numeric data type and set ‘million’ as the column unit. E.g., “$7.44B” –> 7440

For “Date.Joined”, split the string data type into three new columns, dayJoin, MonthJoin, and yearJoin, in numeric data type. E.g., “4/7/2017” –> 4, 7, 2017 into three different columns.

For “Founded.Year”, “Deal.Terms”, and “Investors.Count”, turn the string data type into numeric data type. E.g., “2019” –> 2019

After Cleaning

There is no more NA value in the data set.

## [1] "Company 0"
## [1] "Valuation...B. 0"
## [1] "Date.Joined 0"
## [1] "Country 0"
## [1] "City 0"
## [1] "Industry 0"
## [1] "Founded.Year 0"
## [1] "Total.Raised 0"
## [1] "Investors.Count 0"
## [1] "Deal.Terms 0"
## [1] "dayJoin 0"
## [1] "monthJoin 0"
## [1] "yearJoin 0"

The data type of each column is now correct and useful.

## [1] "Company ---> character"
## [1] "Valuation...B. ---> double"
## [1] "Date.Joined ---> character"
## [1] "Country ---> character"
## [1] "City ---> character"
## [1] "Industry ---> character"
## [1] "Founded.Year ---> double"
## [1] "Total.Raised ---> double"
## [1] "Investors.Count ---> double"
## [1] "Deal.Terms ---> double"
## [1] "dayJoin ---> double"
## [1] "monthJoin ---> double"
## [1] "yearJoin ---> double"

After the cleaning, the total data rows is dropped to 962 from 1037.

## [1] 962

The columns after deleting redundant columns and creating new columns.

## [1] "Company"
## [1] "Valuation...B."
## [1] "Date.Joined"
## [1] "Country"
## [1] "City"
## [1] "Industry"
## [1] "Founded.Year"
## [1] "Total.Raised"
## [1] "Investors.Count"
## [1] "Deal.Terms"
## [1] "dayJoin"
## [1] "monthJoin"
## [1] "yearJoin"

Looking at a single row of data

Before

After

Analysis

Refining Data

For ease of analysis, we’ve decided to only look at countries with more than 20 unicorn companies.

tmp <- as.data.frame(table(df$Country))
tmp <- tmp[tmp$Freq > 20,]
df <- df[df$Country %in% tmp$Var1,]

Startups by Country

Startups by Industry

Analysis of Valuation

A majority of the startups’ valuations are clustered around $1 billion. This is certainly caused by the cut off point at which companies are considered to be unicorns. As such, viewing all the companies together, we see a significant number of outliers. It begs the question of how many of these higher valued companies would still be considered outliers if this dataset included companies valued under $1 billion. Still, we can separate the companies by country, and look at the distributions of valuations through that lens.

Valuation separated by country.

Here, we see a similar but slightly different picture. Most notably, the countries with significantly more unicorn companies, have a tendency to have a significant number of these much higher valued companies and, as a result, significantly more outliers. Interestingly enough, these highly valued outliers are still not large enough in number to significantly shift the median.

Valuation versus Money Raised

Next, lets look further at all of these outliers by examining a companies valuation compared to the total money the company has raised. Due to the significant number of companies based in the United States, the US has an outsize influence on this regression line.

Interestingly though, the countries with less total companies seem to have companies that raise more money (compared to the each company’s valuation). This begs some interesting questions: Are these very highly valued companies actually able to raise the funds they need to be successful? Are the unicorns from countries with less total companies more likely to be successful, as they are raising more money (when compared to their valuation)? Does this simply mean that many of the highly valued companies are over valued?

Central Limit Theorem for Valuation

The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. Essentially, as our sample size grows, the means of the samples will converge towards the mean and create a normal distribution. Below shows the distributions of 1000 random samples of sample sizes 10, 20, 30, and 40.

We saw earlier that the distribution of valuations was not normal and had a significant right skew. This skew can be seen quite clearly when the sample size is only 10, but we can see the skew essentially disappear once the sample size grows large enough.

## [1] "Sample Size =  10  Mean =  3.57208689822019  SD =  2.5457077984006"
## [1] "Sample Size =  20  Mean =  3.46541379098787  SD =  1.77535883477021"
## [1] "Sample Size =  30  Mean =  3.47917694964797  SD =  1.41146025849021"
## [1] "Sample Size =  40  Mean =  3.49287885883917  SD =  1.17044277917296"
## [1] "Total Size =   789 Mean =  3.46157160963245  SD =  7.96289444123421"

Sampling

Here, we use sampling to get a representative portion of the population in order to perform further analysis. There are a variety of sampling methods and here we’ve chosen a few to look at the subsets that are created by the various sampling methods.

Original Data

SRSWOR

Systematic Sampling

Stratified Sampling

## Stratum 1 
## 
## Population total and number of selected units: 143 14.49937 
## Stratum 2 
## 
## Population total and number of selected units: 23 2.332066 
## Stratum 3 
## 
## Population total and number of selected units: 23 2.332066 
## Stratum 4 
## 
## Population total and number of selected units: 61 6.185044 
## Stratum 5 
## 
## Population total and number of selected units: 39 3.954373 
## Stratum 6 
## 
## Population total and number of selected units: 500 50.69708 
## Number of strata  6 
## Total number of selected units 80

Sampling Conclusion

Total Data

SRSWOR

Systematic

Stratified